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ABSTRACT 

Motivation: Understanding exact structure of an individual's haplo- 
type plays a significant role in various fields of human genetics. 
Despite tremendous research effort in recent years, fast and accurate 
haplotype reconstruction remains as an active research topic, mainly 
owing to the computational challenges involved. Existing haplotype 
assembly algorithms focus primarily on improving accuracy of the 
assembly, making them computationally challenging for applications 
on large high-throughput sequence data. Therefore, there is a need to 
develop haplotype reconstruction algorithms that are not only accur- 
ate but also highly scalable. 

Results: In this article, we introduce FastHap, a fast and accurate 
haplotype reconstruction approach, which is up to one order of mag- 
nitude faster than the state-of-the-art haplotype inference algorithms 
while also delivering higher accuracy than these algorithms. FastHap 
leverages a new similarity metric that allows us to precisely measure 
distances between pairs of fragments. The distance is then used in 
building the fuzzy conflict graphs of fragments. Given that optimal 
haplotype reconstruction based on minimum error correction is 
known to be NP-hard, we use our fuzzy conflict graphs to develop a 
fast heuristic for fragment partitioning and haplotype reconstruction. 
Availability: An implementation of FastHap is available for sharing on 
request. 

Contact: sepideh@cs.ucla.edu 



1 INTRODUCTION 

All diploid organisms have two homologous copies of each 
chromosome, one inherited from each parent. The two DNA 
sequences of a homologous chromosome pair are usually not 
identical to each other. The most common DNA sequence vari- 
ants are single nucleotide polymorphism (SNP). We refer to the 
sites at which the two DNA sequences differ as heterozygous 
sites. Current high-throughput sequencing technologies (Eid 
et ah, 2009) are incapable of reading the DNA sequence of an 
entire chromosome. Instead, they produce a huge collection of 
short reads of DNA fragments. The process of inferring two 
DNA sequences (i.e. haplotypes) from a set of reads is referred 
to as haplotype assembly, which has become a crucial computa- 
tional task to reconstruct one's genome from these reads. 

Haplotype assembly methods usually involve three main 
stages before reconstruction phase. First, a sequence aligner is 
used to align the reads to the reference genome. Then, only the 
read alignments at the heterozygous sites are kept for haplotype 
reconstruction. Last, reads that span multiple heterozygous sites 
are used to infer the alleles belonging to each haplotype. 
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The quality of the reconstructed haplotypes may be dramatically 
affected by errors in sequencing and alignment. The objective, 
therefore, is to design algorithms that mitigate this impact and 
rebuild the most likely copies of each chromosome accurately. 
This has led to development of accurate haplotype reconstruc- 
tion algorithms in the past few years. We are, however, observing 
a critical shift in sequencing technology where larger datasets 
with longer reads and higher coverage become available. This 
shift necessitates the development of algorithms that not only 
reconstruct haplotypes accurately but also require low computa- 
tion time and can scale to large datasets. In this article, we intro- 
duce a new framework for fast and accurate haplotype assembly. 

1.1 Motivation and related work 

The past decade has witnessed much research effort on enhan- 
cing accuracy of haplotype assembly methods. The research, 
however, lacks a method that is not only accurate but also fast 
enough that can be used widely on large-scale datasets. In 
particular, current trends in sequencing technologies demon- 
strate that the sequence read lengths are being extended signifi- 
cantly and access to reads of up to several thousand base pair 
long will become a reaUty in near future. 

Haplotype assembly approaches can be divided into two cate- 
gories: (i) fragment partitioning; (ii) SNP partitioning. The frag- 
ment partitioning techniques partition the set of fragments into 
two disjoint sets each representing one copy of the haplotype. 
Examples of such techniques are FastHare (Panconesi and Sozio, 
2004) and the greedy heuristic in (Levy et al, 2007). The SNP 
partitioning approaches such as HapCut (Bansal and Bafna, 
2008), HapCompass (Aguiar and Istrail, 2012) and the approach 
in (He et al., 2010) rely on partitioning the SNPs into two disjoint 
sets and finding those variants whose corresponding haplotype 
bits need to be flipped to improve minimum error correction 
(MEC). In any of the two scenarios, an iterative process is 
involved. From a computational complexity point of view, the 
main drawback with existing techniques is that they perform 
much computation during each iteration of the algoritlim. 

HapCut (Bansal et al., 2008) is an example of the algorithms 
that use SNP partitioning technique to minimize MEC criterion. 
The process involves iteratively reconstructing a weighted graph 
and finding a max-cut of the graph. Clearly, most of the 
computation occurs in a loop. The algorithm has proved to be 
fairly accurate at the cost of high computation. The greedy heur- 
istic algorithm in (Levy et ah, 2007) is a fragment partitioning 
approach. The iteration, however, involves two major computing 
tasks: (i) reconstructing a partial haplotype based on the frag- 
ments that are already assigned to a partition; (ii) calculating 
distance between unassigned fraginents and each one of the 
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haplotype copies. FastHare (Panconesi and Sozio, 2004) is an- 
other fragment partitioning algorithm. It sorts all fragments 
based on their positions before execution of the iterative 
module. Computationally intensive tasks that occur iteratively 
in FastHare include (i) reconstruction of a partial haplotype 
based on the fragments that are already assigned to a partition; 
(ii) calculating distance between the current fragment and each 
one of the two haplotype copies. 



1.2 Contributions 

Our goal in this article is to develop a framework for fast and 
accurate haplotype reconstruction. Our approach consists of 
four steps: (i) we measure dissimilarity of every pair of fragments 
using a new distance metric; (ii) we build a weighted graph, called 
fuzzy conflict graph, using the introduced dissimilarity measure; 
(iii) we use the fuzzy conflict graph to construct an initial parti- 
tion of the fragments through an iterative process; (iv) we refine 
the initial partitioning to further improve the overall MEC of the 
constructed haplotypes. More specifically, our contributions in 
this article can be summarized as follows. 

Inter-fragment distance: We introduce a new distance metric, 
called inter-fragment distance, which quantifies dissimilarity 
between pairs of fragments. This distance measure is carefully 
developed to not only assign small values to the fragments that 
match perfectly and large values to completely different frag- 
ments but also neutralize the effect of missing alleles on final 
partitioning of the fragments. 

Fuzzy conflict graph: We introduce the notion of fuzzy conflict 
graphs that are built based on the inter-fragment distances. In 
our graph model, each node represents a fragment and edge 
weights are corresponding dissimilarity measures between por- 
tions of fragments. 

Fragment partitioning algorithm: We present a two-phase com- 
putationally simple heuristic algorithm for haplotype reconstruc- 
tion. The first phase uses a fuzzy conflict graph to build an initial 
fragment partition. In the next phase, the initial partition is 
further refined to achieve additional improvements in the overall 
MEC performance of the reconstructed haplotypes. 

Validation using real data: We demonstrate the effectiveness of 
the proposed techniques using HuRef dataset, a dataset that has 
been widely used in haplotype assembly literature recently. 
Specifically, we compare our method with several previously 
published algorithms in terms of accuracy (MEC measures) 
and scalability (execution time) performance. Our results show 
that FastHap significantly outperforms the previous algoritlmis 
by providing a speedup of one order of magnitude while 
delivering comparable or better accuracies. 

Our objective is to build a fast haplotype assembly model 
where computationally intensive tasks are executed before exe- 
cution of the iterative process. Our algorithm has the following 
major differences compared with the previous work: (i) our 
dissimilarity measure is a novel distance metric that precisely 
quantifies contribution of each individual fragment for haplo- 
type assembly; (ii) we perform all distance calculations at the 
beginning of the algorithm and leave only computationally 
simple tasks to the iterative section; (iii) we perform haplotype 
reconstruction outside the iterative part of our algorithm. 



2 MATERIALS AND METHODS 

2.1 Problem statement and assumptions 

We assume that the input to the haplotype assembly algorithm is a 2D 
array containing only heterozygous sites of the aligned fragments, called 
variant matrix, of size in x «, where m denotes the number of fragments 
(aligned DNA short reads) and n represents the number of SNPs that the 
union of all fragments cover. In the following discussion, we use .v,, to 
refer to the allele of fragment / at SNP .t,/ 6 {0, 1, — ), where 0 and 1 
encode two observed alleles and - denotes that fragment/ does not cover 
the SNP site .s,. If there are more than two alleles observed at a given site, 
the two most common alleles are encoded with 0 and 1, and the remaining 
allele(s) are encoded by -. It is expected that most cells in X are filled with 
- because, in practice, each aligned fragment covers only a few SNP sites, 
limited by the fragment length (As discussed previously, the trend is that 
much longer DNA reads will be available as a result of recent techno- 
logical advancements in genome sequencing). 

Algorithm 1 FastHap high-level overview 
Initialization: 

Calculate inter-fragment distance between every pair of fragments 

(Section 2.2) 
Store inter-fragment distances in A (Section 2.2) 
Use A to construct a fuzzy conflict graph (Section 2.3) 

Piiase (I): Partitioning 

Partition fragments into two disjoint sets C\ and Cj (Section 2.4 and 
Algorithm 2) 

Phase (II): Refinement 

while (MEC score improves) do 

Find fragment / with highest MEC value 

Assign / to the opposite partition 
end while 

One of the most popular approaches for haplotype assembly is to 
construct haplotypes based on partitioning of the fragments in variant 
matrix. In this case, the haplotype assembly problem consists of two 
steps, namely fragment partitioning and fragment merging, described as 
follows. While the fragment partitioning phase aims to group rows of the 
variant matrix into two partitions, Ci and C?, fragment merging is 
intended to combine the fragments residing in each partition, through a 
SNP-wise consensus process, and form two haplotypes hi and h2 
associated with Ci and C2, respectively. The resulting haplotype is typic- 
ally denoted by H ^ {hi, /12}. The main objective of the haplotype 
assembly is, therefore, to come up with a partitioning such that the 
amount of error is minimized. Our focus in this article is on minimizing 
the MEC objective function. As mentioned previously, this problem is 
proved to be NP-hard (Cihbrasi et a!., 2005). Thus, our goal is to 
develop a heuristic algorithm for the haplotype assembly problem. Our 
solution relies on a novel inter-fragment distance measure, a graph 
inodel for inter-fragment dissimilarity assessment and a fast graph parti- 
tioning algorithm. A high-level overview of FastHap is shown in 
Algorithm I. 

2.2 Inter-fragment distance 

Given two variables x, y 6 {0, 1, — ), we define the operator (gi as follows. 
0 if; 



x^y = 



1 if .v# v&.T,.re (0,1) 
0.5 otherwise 



(1) 
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Definition 1 (Inter-fragment distance). Given a variant matrix X„,^„ 
where Xy s {0, 1, — ). hp define inter-fragment distance, A(/i,/j), between 
fragments fi = {,t,i , X/i x,,,} andfk = {x^-i , .\-a-2, . . ., Xk„} by 



mjk)=—Yyx,i®xkj) 



(2) 



where T,/, denotes the number of columns (SNPs) that are covered by 
either /i or //,, in X. In fact, Tu^ is a normalization factor that allows us to 
normalize the distance between the two fragments such that the resulting 
distance ranges from 0 to 1 (i.e. 0 < A(/i,/() < 1). 

The inter-fragment distance metric is developed with the goal of mea- 
suring the cumulative dissimilarity between each pair of fragments across 
all SNP sites. The intuition behind (1) and (2) is as follows. At a given 
SNP site Sj, if two fragments / and both cover it, the per-site distance is 
0 if they take the same allele (suggesting they may likely belong to the 
same partition) and 1 if they take opposite alleles (suggesting they may 
likely belong to different partitions). We assign 0.5 distance if the SNP 
is only covered by one of the two fragments to neutralize the contribu- 
tion of the missing element. If the SNP is not covered by either fragment, 
0 distance is cumulated at this site. An additional benefit of this 
approach is that we need to examine only SNPs covered by either 
of the two fragments. From a computing complexity point of 
view, this can reduce the execution time of the distance calculation 
significantly. 

Figure la shows a set of fragments spanning eight SNP sites. The 
resulting inter-fragment distances are shown in a symmetric distance 
matrix in Figure lb. Intuitively, A (the distance measure between two 
fragments) is smaller for those fragments that need to be grouped to- 
gether and larger for those that we prefer to be placed in different par- 
titions. When distance between the two fragments is 0.5, the two 
fragments alone do not provide sufficient information as how they 
need to be partitioned. 

Definition 2 (Pivot distance). Given a variant matrix X„,^„, the pivot 
distance between any pair of fragments in {fi,f2, ■ ■ ■ .fm] '-^ ^ = 0-5. 

The pivot X allows us to decide whether the two fragments are dissimi- 
lar enough to be placed in separate partitions. A pair of fragments with 



a) Variant Matrix X 



S6 S7 S8 



c) Fuzzy Conflict Graph (G) 



b) inter-Fragment Distance IVlatrrx A 




d) Resulting partition and iVIEC 

X = {f1 ,f2,f3,f4,f5.f6,f7,f8,f9,f1 0} 
C1 = {f1,f2,f3,f6,f10} 
C2 = {f4,f5,f7,f8,f9} 
MEC = 4 

Reconstruction rate = 100% 



Fig. 1. An example of variant matrix with 8 SNP sites (a), corresponding 
distance matrix (b), the fuzzy conflict graph associated with the variant 
matrix (c) and results of applying FastHap on the data (d). The graph in 
(c) shows only edges with non-pivot distances 



an inter-fragment distance greater than X is more likely to be placed in 
different partitions, although the final partitioning assignment is made 
after all pairs of fragments are examined through a partitioning 
algorithm. In Section 2.3, we will present a graph model that enables 
us to perform fragment partitioning by linking similar and dissimilar 
fragments through a weighted graph based on inter-fragment distance 
values. 

2.3 FastHap graph model 

In this section, we present a graph model based on the inter-fragment 
distance defined in (2). In Section 2.4, we will discuss how this graph 
model can be used to partition the fragments into two disjoint sets and 
construct haplotypes accordingly. 

Definition 3 (Fuzzy conflict graph). Given a variant matrix X composed 
of m fragments ^fufi, ■ ■ ■ .fm) spanning n SNP sites, a fuzzy conflict graph 
that models dissimilarity between pairs of fragments is a complete graph G 
represented by the tuple ( V,E, We) ■ In this graph, V ~ {1 , 2, . . . , m} is a set 
of m vertices representing the fragments in X; each edge e; is associated with 
a weight w/ equal to the distance between the corresponding fragments in X. 

The conflict graph introduced in this article, fuzzy conflict graph, is 
different from that used in previous research [e.g. the fragment conflict 
graph in (Lancia et al., 2001)]. A conflict graph has been conventionally 
defined as a non-weighted graph. Let us call it a binary conflict graph, 
which represents any pair of fragments with at least one mismatch in the 
variant matrix. For example, according to (Lancia et al., 2001), a conflict 
graph is a graph with an edge for each pair of fragments in conflict where 
two fragments are in conflict if they have different values in at least one 
column in the variant matrix X. There are a number of shortcomings with 
respect to using a binary conflict graph for haplotype assembly. The 
major problem with the conventional conflict graph is that it does not 
take into account the number of SNP sites for which the two fragments 
exhibit a mismatch. Two fragments are considered in conflict even if there 
is a mismatch at only one SNP site. In contrast, our fuzzy conflict 
graph aims to measure the amount of mismatch across all SNP 
sites of every pair of fragments. For example, consider the 

three fragments /, = {-- 000 }, ./8 = {--lll } and 

/lo = { 010 } in Figure 1. In a binary conflict graph, all the 

vertices are connected because there is at least one mismatch between 
every pair of fragments: three mismatches between f\ and /g, one 
mismatch between /i and fo and two mismatches between /g and /io. 
The binary conflict graph, however, treats all three edges equally. Our 
fuzzy conflict graph assigns weights of 1, 0.33 and 0.66 to these edges, 
respectively, to lead the partitioning algorithm to group f\ and f\o 
together. 

An example of a fuzzy conflict graph based on the fragments listed in 
Figure la is illustrated in Figure Ic. For visualization, the edges with a 
pivot distance are not shown. The problem of dividing the fragments into 
two most dissimilar groups is essentially a max-cut problem (Ausiello, 
1999). A max-cut partition may divide the fragments into Ci = {f\,f2,fi, 
fb,f\os and Cj = {/4, /s, ./v, /s, ./!)} as shown in Figure 1. We note that the 
resulting partition may not be unique in its general case. As we will dis- 
cuss in more details in Section 2.4, max-cut is an NP-hard problem, and 
existing techniques provide solutions that are highly suboptimal. Thus, 
we will leverage some properties of our fuzzy conflict graphs to develop a 
heuristic approach for fragment partitioning. 

2.4 Fragment partitioning 

As stated previously, our goal is to partition fragments into two disjoint 
sets such that fragments within each group are most similar and can form 
a haplotype with minimum MEC. Using the fuzzy conflict graph model 
presented in Section 2.3, a weighted max-cut algorithm needs to be used 
to find the optimal partition. The max-cut problem, however, is known to 
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be NP-hard even when all edge weights are set to one (Garey and 
Johnson, 1990). All edges in our fuzzy conflict graph have a positive 
weight. There exist heuristic algorithms (Sahni and Gonzales, 1974) 
that produce a cut with at least half of the total weight of the 
edges of the graph when all edges have a positive weight. In fact, a 
simple \ - approximate randomized algorithm is to choose a cut at 
random. This means that each edge e/ in the fuzzy conflict graph G is 
cut with a probability of |. Consequently, the expected weight of the edges 
crossing the cut, IV{C\,C2), is given by 

1 ^ 1 

W(CuC2)=-Y,'''i>^OPT (3) 

^ /-I ^ 

This algorithm can be derandomized to obtain a - approximate de- 
terministic algorithm. There exist two major shortcomings with this par- 
titioning algorithm: (i) Unfotunately, derandomization works well only 
on unweighted graphs where all edges have equal/unit weights. A similar 
approach for a weighted graph is not guaranteed to run in polynomial 
time; (ii) the obtained partition is highly suboptimal with an approxima- 
tion factor of i. Thus, we introduce a novel heuristic algorithm based on 
properties of fuzzy conflict graphs. 



Algorithm 2 FastHap partitioning algorithm 
Require: Fuzzy conflict graph G = ( V,E, We) 

Ensure: Partition P = [C\, C2] composed of two groups Ci and Cj of 
fragments 

(1) Delete edges with pivot weights from G 

(2) Sort remaining edges in G based on their weights 117 and store 
results in list D 

(3) Let ei = {fj,fii) be the edge with the largest weight in D 

(4) Initialize partition by assigning / and /j. to opposite groups 

(e.g. c, = {/a &C2 = m 

while (not all vertices are partitioned) do 

(5) Let ei = (fi,fk) be next edge with highest weight in D such that 
f.-eP or ft 6 P 

(6) Let fieP- if / eCi, then C2 = C2U{/i;), otherwise 

Ci = CiU{/,) 

(7) Let €,■ = {/], fk) be next edge with lowest weight in D such that 
,// 6 P or fk 6 P 

(8) Let ,/; 6 />, then if / 6 Ci, then C\=C\U{fk], otherwise 

C2 = C2U{/,) 

(9) If none of c/ and e,. exist, assign each remaining fragment to the 
more similar set 

end while 
repeat 

(10) Let MEC be the switching error for existing partition 

(11) Let fi be the fragment with largest switching error among all 
fragments in P 

(12) If fi e Ci (alternatively / 6 C2), move /j from to Cj (alterna- 
tively to Ci) 

(13) Let newMEC bet the switching error for the new partition 
until (iiewMEC > MEC) 



The FastHap partitioning algorithm is shown in Algorithm 2 and 
briefly explained as follows. First, the algorithm eliminates all edges 
with pivot weights from the fuzzy conflict graph G. Such edges do not 
contribute to formation of the final partition. The algorithm then sorts all 
edges of the graph (equivalently, pairs of the fragments) based on the 
edge weights and stores the results in D. An initial partition is formed by 
placing the two fragments with largest inter-fragment distance (associated 
with the heaviest edge in G) into two separate partition sets Ci and €'2- 
In the next phase, the algorithm alternates between the heaviest and 
lightest edges and assigns adjacent vertices (associated with fragments 



in X) to the existing partition if either of the vertices is already assigned 
to the partition. An edge with highest weight results in placing 
the adjacent vertices in different partitions and an edge with lowest 
weight attempts to assign the vertices to the same partition in 
P. This occurs only if the chosen edge is adjacent to an already parti- 
tioned edge. 

Theorem 1 . Algorithm 2 terminates in polynomial time. 

Proof. We prove that the algorithm terminates and its rumiing time 
is polynomial. Let M be the total number of edges in the given fuzzy 
conflict graph, respectively. During each iteration, the algorithm attempts 
to assign two edges (those with highest and lowest weights and adjacent 
to an already partitioned vertex) to the final partition P. Clearly, 
the iterative loop does not repeat more than M' times. In fact, during 
each iteration at least one edge (i.e. e/ or e,. or both) is selected to be added 
to the final partition. If the algorithm cannot find such an edge, all 
remaining edges are allocated to the final partition and the algorithm 
ends. Therefore, the iterations cannot repeat more than M' times and 
the algorithm will terminate after at most M' iterations. The proof 
regarding computing complexity of the algorithm is as follows. 
Removal of edges with pivot weight in (1) in Algorithm 2 can be 
completed in 0(M); the process of sorting the edges in instruction 
(2) can be done in 0(Mlog M); instruction (3) takes 0(1) to complete. 
The initialization of the partition in (4) can be done in 0(1); (5) detecting 
the edge with the highest weight and checking if one of its vertices 
(i.e. fragments) is already in the partition P require 0(1) and 0(M), 
respectively, to be completed; (6) assigning the selected edge to the 
partition require 0(1); similarly, instructions in (7) require 0(1) and 
0(M) to finish; similar to (6), the instructions in (8) can be done 
in 0(1). The instructions in (9) have a complexity of 0(M); reading 
the MEC value of the partition in (10) and (13) takes 0(1); 
instructions in (11) and (12) can finish in 0(M). Given that instructions 
in the loop are executed at most M times, the complexity of the algorithm 
is 0(Af). 



2.5 ReHnement phase 

The second loop in Algoritlim 2 shows second phase of the 
proposed haplotype reconstruction approach. The idea is to itera- 
tively find the fragment that contributes most to the MEC score 
and reassign it to the opposite partition. This process repeats as long as 
the MEC score improves. Our experimental results show that the first 
phase of the algorithm performs most of the optimization in terms of 
MEC improvements, leaving minimal improvements for the second 
phase. 

2.6 Fragment purging 

Because the complexity of FastHap is a function of the number of frag- 
ments in the variant matrix, it is reasonable to attempt to minimize the 
number of such fragments by eliminating any potential redundancy be- 
fore execution of the main algorithm. Therefore, FastHap uses a prepro- 
cessing phase during its initialization to combine those fragments that are 
highly similar. Fortunately, the inter-fragment distance measure provides 
a means to assess similarity between every two fragments. The criterion 
for combining two fragments /J and f^ is based on the inter-fragment 
distance \(fi,fk) and a given threshold a. The two fragments are 
merged if 

A(/;./i)<a (4) 

The purging process is straightforward. It eliminates the shorter frag- 
ments from the variant matrix. The value of a needs to be set based on 
the quality of data. For the dataset used in our experiments on different 
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chromosomes, we set a experimentally and found that a = 0.2 provides 
the best performance. 

3 VALIDATION 

3.1 Setup 

We used HuRef (Venterlnst, 2014), a publicly available dataset, 
to demonstrate the effectiveness of FastHap for individual haplo- 
type reconstruction. Our goal was to assess performance of 
FastHap in terms of both accuracy and speed in comparison 
with HapCut (Bansal and Bafna, 2008) and greedy algorithm 
in Levy et al. (2007). The main reason for choosing these two 
algorithms is that these algorithms have been historically popular 
in terms of accuracy and computing complexity. We ran all our 
experiments on a Linux x86 server computer. The server had 16 
CPU cores of 2.7 GHz with 16 GB of RAM. Each algorithm 
performed per-block haplotype reconstruction. Each block con- 
sisted of the reads that do not cross adjacent blocks. Although 
haplotype assembly solutions cannot do more than random 
guess between two consecutive variant site that do not share 
any fragments, our effort in this article was to provide technol- 
ogy that is appropriate for longer reads in each end of paired 
alignment and ample insertion size to minimize discoimection 
between different haplotype blocks. 

3.2 Dataset 

The HuRef dataset used for our analysis contains reads for all 22 
chromosomes of an individual, J.C. Venter. The data include 32 
million DNA short reads generated by Sanger sequencing 
method with 1.85 million genome- wide heterozygous sites. 
There are too many fairly short reads of ~I5bp (each end) 
while still tens of thousands of reads are long enough to cover 
more than two SNP sites and can be used for haplotype assembly 
purposes. In fact, many fragments within each block span several 
hundred SNP sites owing to the pair-end nature of the aligned 
reads. The variant matrix used for haplotype assembly was 
generated based on aligned short reads with paired-end 
method for each pair of various length (from 15 to 200 bp each 
end) while the insert length follows a normal distribution with a 
mean of 1000. 



Figure 2a shows read coverage for each chromosome. Read 
coverage numbers are calculated by taking an average over the 
coverage values of all SNP sites within each chromosome. The 
coverage vary from 6.49 reads for chromosorne 19 to 8.72 reads 
for chromosome 3. The average genome-wide coverage across all 
chromosomes is 7.43. Figure 2b shows distribution of the cover- 
age for chromosome 20 (exemplary), which includes 39 767 SNP 
sites. The coverage numbers range from 1 to 20 reads. Only two 
SNP sites had a coverage of 20. The average coverage for 
chromosome 20 was 6.83. 

Figure 3 shows several statistics on haplotype length of various 
chromosomes in HuRef dataset. Figure 3a shows chromosome- 
wide haplotype length, equivalent ally total number of SNP sites, 
for each chromosome. As mentioned previously, each chromo- 
some is divided into non-overlapping blocks. Haplotype length 
of such blocks may vary significantly from one chromosome to 
another. For example. Figure 3b shows distribution of haplotype 
length for a subset of chromosomes with 'small', 'medium' and 
'large' haplotypes. For instance, chromosome 8 has a number of 
blocks spanning >2500 SNPs. In contrast, haplotypes in chromo- 
some 18 barely exceed 1000 SNP sites. 

In addition to running FastHap on real HuRef data, we con- 
structed several simulated read matrices based on HuRef data 
(Bansal and Bafna, 2008). A simulated dataset based on real data 
allows us to assess performance of the proposed algorithm and 
extend its capabilities by changing various parameters (e.g. error 
rate, coverage and haplotype length or block width). To assess 
the accuracy of our method, we simulated a pair of chromosome 
copies based on real fragments and consensus SNP sites provided 
by HuRef data. The variant matrix for each chromosome 
on HuRef data was suitably modified to generate an 'error 
free' matrix at first. This was accomplished by modifying 
alleles in each fragment such that it perfectly matches a prede- 
fined haplotype. To introduce errors in the variant matrix, each 
variant call was flipped with a probability of e ranging from 0 to 
0.25. We also modified the variant matrix to produce vari- 
ant matrices of different coverage. Another change to the simu- 
late variant matrix was to generate blocks of varying haplotype 
length ranging from 200 to 1000 SNPs. Such variant 
matrices were then used to examine how performance of 
different algorithms (i.e. FastHap, Greedy, HapCut) changes 




Chromosomes 



(a) Coverage for each chromosome 




(b) Histogram of coverage for 
randomly selected chromosome 



Fig. 2. Coverage of HuRef dataset; (a) coverage for each chromosome; 
numbers vary from 6.49 to 8.72 for various chromosomes with an average 
genome-wide coverage of 7.43. (b) Histogram of coverage for chromo- 
some 20 as an example; Y-axis shows number of SNPs, with each specific 
coverage shown on x-axis 
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(a) Haplotype length per chromosome 



(b) Histogram of haplotype length for 
three exemplary chromosomes 



Fig. 3. Chromosome-wide haplotype length for each chromosome (a) 
and histogram of per-block haplotype length for chromosomes 8, 17 
and 18 as examples of chromosomes with 'small', 'medium' and 'large' 
blocks, respectively (b) 
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as a result of changes in error rate, coverage and haplotype 
length. 

3.3 Results 

Table 1 shows speed and accuracy results for all chromosomes 
on HuRef dataset. As it can be observed from the timing values, 
FastHap is significantly faster than both Greedy and HapCut. In 
particular, FastHap is up to 16.4 times faster than HapCut 
and up to 15.1 times faster than Greedy. The average speedup 
achieved by FastHap is 7.4 and 8.1 compared with Greedy and 
HapCut, respectively. In terms of accuracy performance, 
FastHap achieves 35.4 and 1.9% improvement in reducing 
switch error compared with Greedy and HapCut, respectively. 

A number of parameters affect speed performance of different 
algorithms. In particular, number of SNP sites within each vari- 
ant matrix is an important factor in many well-known algorithms 
such as HapCut. One advantage of FastHap is that its perform- 
ance is primarily influenced by the number of fragments in the 
variant matrix rather than the number of SNP sites. That is, a 



Table 1. Comparison of FastHap with Greedy and HapCut in terms of 
accuracy (MEC) and execution time using HuRef dataset. Best results in 
each column showed in bold 



Time Speedup using MEC (accuracy 

(min) FastHap performance) 



Chr Greedy HapCut FastHap versus versus Greedy HapCut FastHap 

Greedy HapCut 



1 


606 


880 


183 


3.3 


4.8 


29 657 


19 750 


19423 


2 


1001 


2446 


149 


6.7 


16.4 


22980 


14677 


14220 


3 


1809 


1053 


188 


9.6 


5.6 


16 878 


10738 


11794 


4 


542 


694 


63 


8.6 


11.0 


18 153 


11931 


11812 


5 


1381 


3229 


282 


4.9 


11.4 


16 590 


10630 


10 362 


6 


681 


750 


109 


62 


6.9 


15 587 


9992 


9870 


7 


456 


604 


76 


6.0 


7.9 


17402 


11290 


11245 


8 


5052 


4514 


334 


I5.I 


13.5 


14887 


9845 


10830 


9 


2006 


1747 


293 


6.8 


6.0 


13 812 


9318 


9204 


10 


667 


1445 


170 


3.9 


8.5 


15291 


9906 


9796 


11 


332 


288 


69 


4.8 


4.2 


12906 


8294 


8091 


12 


1303 


1638 


165 


7.9 


9.9 


12630 


8297 


7467 


13 


428 


761 


158 


2.7 


4.8 


9312 


6131 


6143 


14 


3315 


1919 


383 


8.7 


5.0 


9734 


6360 


5725 


15 


907 


1137 


208 


4.4 


5.5 


13 988 


9783 


9695 


16 


157 


248 


42 


3.7 


5.9 


12621 


8354 


8215 


17 


2223 


2790 


246 


9.0 


11.3 


11 157 


7398 


7386 


18 


698 


798 


87 


8.0 


9.2 


8578 


5043 


4846 


19 


309 


501 


41 


7.5 


12.2 


8214 


5497 


4886 


20 


326 


348 


32 


10.2 


10.9 


5752 


3784 


3437 


21 


482 


154 


48 


10.0 


3.2 


6611 


4715 


4707 


22 


535 


128 


39 


13.7 


3.3 


8295 


5864 


5875 


Overall 


25217 


28 074 


3365 


7.4 


8.1 


301 035 


197597 


195029 



Sum over all Average Sum over all 

chromosomes chromosomes 



Note. FastHap achieves speedups of 16.4 and 15.1 compared with HapCut and 
Greedy, respectively, is 1.9 and 35.4% more accurate than HapCut and Greedy, 
respectively. Statistics on coverage and haplotype length are shown in Figs 2 and 3 
and further discussed in Section 3.2. 



higher read coverage allows FastHap to generate better accuracy 
without significant impact on its running time. In contrast, as the 
haplotype length grows, HapCut algorithm runs very slowly 
compared with FastHap. As shown in Table 1, HapCut is very 
slow when applied to chromosome 8 primarily owing to the large 
haplotype length. This is also confimied through Figure 3b, 
which shows that chromosome 8 contains blocks that span 
>2500 SNPs. In contrast, chromosome 18, for exainple, can be 
reconstructed much faster when HapCut is used. Figure 3b 
shows that most of the blocks for chromosome 18 span <1000 
SNP sites. 

Using the simulated data described in Section 3.2, we ran 
FastHap on variant matrices of varying error rates and com- 
pared the reconstructed haplotypes with the true haplotypes. 
With this, we obtained the absolute accuracy results shown in 
Table 2. For brevity, results are shown only for 6 error rate 
values. The table shows how the absolute accuracy of the ob- 
tained haplotype is affected as a result of introduced errors. We 
observe that the accuracy numbers are always larger than what 
one may expect owing to the error rate. For example, when the 
error rate is 20%, one may expect an absolute accuracy of 80%, 
but the measured accuracy is 85.7%. This can be interpreted as 
follows. As the error rate (i.e. number of flipped variant calls) 
increases, some variant calls may become consistent with a dif- 
ferent haplotype of higher accuracy. 

Figure 4a shows the MEC score per variant call versus the 
simulated error rate obtained by each one of the three algo- 
rithms. The average MEC (normalized by number of variant 
calls) was 2.48, 2.56 and 2.86 for FastHap, HapCut and 
Greedy, respectively. The amount of improvement in MEC 
using FastHap was 13 and 2.8%o compared with Greedy and 
HapCut, respectively. Figure 4b shows the running time of the 



Table 2. Absolute accuracy of FastHap as a function of error rate 



Error rate (e in %) 0 5 10 15 20 25 

Accuracy on HuRef (%) 100 96.2 90.3 86.0 85.7 80.6 

Note. Results are obtained using variant matrix based on HuRef datasels. 




E (error rate %) Coverage 



(a) MEC Values (b) Running Time 

Fig. 4. Effect of error rate and coverage on performance of different 
algorithms. The analysis was performed on chromosome 20 (randomly 
selected) of HuRef dataset. (a) Switching error (MEC) of the three algo- 
rithms under comparison as a function of error rate; (b) execution time of 
the algorithms as a function of coverage 
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(a) Running Time 



(b) Speedup 



Fig. 5. Speed performance of the three algorithms as a function of haplo- 
type length. Analysis was performed on chromosome 20 (randomly se- 
lected) of HuRef dataset. (a) Execution time as a function of haplotype 
length, (b) Amount of speedup achieved by FastHap compared with 
Greedy and HapCut 



three algorithms as the coverage varies from 5 to 20. For this 
experiment, the variant matrix was carefully modified to obtain 
the right coverage needed for the analysis. Furthermore, the ob- 
tained matrix was first made 'error free'. We then flipped the 
variant calls with a probabihty of 6 = 0.25 for this analysis. 

To assess running time of different algorithms with respect to 
changes in haplotype length, variant matrices with different 
number of columns were built as explained previously in 
Section 3.2. Figure 5a shows execution time of the three algo- 
rithms as the partial haplotype length grows from 200 to 1000 
SNPs. For this analysis, an injected error rate of e = 0.25 was 
used. We note that the results are shown only for one block of 
data. It can be observed that the running time of HapCut in- 
creases significantly as the block width grows. That is, while 
HapCut can build a partial haplotype of length 200 in 25 s, its 
running time increases to 784 s when the length of the haplotype 
increases to 1000 SNPs. 

To demonstrate superiority of FastHap partitioning algorithm 
over a random partitioning, we selected a subset of the dataset at 
random. We ran both FastHap and random partitioning algo- 
rithms on the same variant matrix 10 times and calculated per- 
centage of improvements in MFC achieved by FastHap. The 
improvement numbers ranged from 12.17 to 31.64%, with an 
average improvement of 19.13%. 



4 DISCUSSION AND FUTURE WORK 

Development of efficient and scalable algorithms for haplotype 
assembly and reconstruction is by large an open research prob- 
lem. Presence of error and missing data in the DNA short reads 
makes the problem challenging. Current approaches suffer from 
limited accuracy and are not scalable for application on large 
datasets. In this article, we presented design, implementation and 
validation of FastHap, a highly scalable haplotype assembly and 
reconstruction method that has shown promising results 
compared with the state-of-the-art assembly techniques. We pre- 
sented a novel dissimilarity metric that quantifies inter-fragment 
distance based on the contribution of individual fragments in 
building a final haplotype. The notion of fuzzy conflict graph 
was proposed to model the haplotype reconstruction as a max- 
cut problem. We then introduced a fast heuristic algorithm for 
fragment partitioning based on the fuzzy conflict graphs. The 



framework lowers computing complexity of haplotype recon- 
struction dramatically while also outperforming accuracy per- 
formance of several popular assembly algorithms. In particular, 
FastHap is up to one order of magnitude faster than HapCut 
(Bansal and Bafna, 2008) and Levy's greedy approach (Levy 
er al., 2007). 

In this article, we compared FastHap with two well- 
known haplotype reconstruction algorithms, namely Levy's 
greedy algoritlmi and HapCut. The greedy algorithrn is his- 
torically known for its high speed while it also outperforms 
accuracy of other computationally simple and greedy al- 
gorithms such as FastHare (Panconesi and Sozio, 2004). 
HapCut, in contrast, is popular for its high accuracy, but 
demands much higher computational resources compared with 
Greedy. 

Because DNA short fragments are used in the process of 
haplotype assembly, the number of SNPs that each short read 
encompasses is considered to be an important factor. As a 
general rule, short reads that cover less than two SNP sites 
are ehminated in our analysis. When two or more variant 
positions are spanned by a single read, or occur on paired 
reads derived from the same shotgun clone, alleles can be 
linked to identify larger haplotypes. Current sequencing technol- 
ogies provide us with fragments that may or may not span 
multiple SNP sites. Although such reads do not link multiple 
SNPs, they can provide useful haplotype information for the 
SNP they cover. Our approach in this article does not require 
a preprocessing phase to eliminate such reads from further 
analysis. 

FastHap is a heuristic approach and may result in a subopti- 
nial solution. Yet, it can provide high-quality phasing of hetero- 
zygous variant sites. Unlike many prior works that use a 
randomly generate vector to seed the initial haplotype, the start- 
ing point of our algorithm is not a completely random pair of 
haplotypes but created using our intelligent distance measure. As 
demonstrated through our results, this approach would signifi- 
cantly improve the time complexity and accuracy of the obtained 
haplotypes. 

Given the promising speed results that we have achieved using 
FastHap, we are planning to further improve the accuracy of our 
algorithm. We believe that the algoiithm can become much 
smarter if a cross-optimization approach is applied where both 
fragment and SNP sets are considered for haplotype reconstruc- 
tion/refinement. 

In this article, we perfomed per-block analysis of speed and 
accuracy. As part of our future work, we plan to study how 
haplotypes generated from each block can be effectively com- 
bined to form genome-wide haplotypes. We also plan to study 
if the errors condensed by MFC values coincide when two haplo- 
type reconstruction algorithms are compared. 

With recent advancements in the sequencing technologies, 
access to long reads of more than few thousand bases is becorn- 
ing a reality (Huddleston et ah, 2014). For example. Pacific 
Biosciences^"^ released an extra-long set of DNA fragments 
with average read length of 8849 bp and up to 54Z coverage. 
The dataset has recently become publicly available 
(PacificBiosciences, 2014). This dataset, which contains single- 
end long fragments, is expected to be an excellent means to dem- 
onstrate huge speed/accuracy benefits that FastHap can provide. 
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Another example is newly released datasets based on 1000 
Genome project (Siva, 2008). These datasets are also large with 
high-density SNP sites. Our ongoing work involves application 
of FastHap on such datasets. 
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